Overview

Dataset statistics

Number of variables9
Number of observations26648
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory1.0 MiB
Average record size in memory40.0 B

Variable types

NUM8
CAT1

Warnings

departure_time is highly correlated with departure_time_dayHigh correlation
departure_time_day is highly correlated with departure_timeHigh correlation
flight_path has 5780 (21.7%) zeros Zeros
airline has 3897 (14.6%) zeros Zeros
departure_time_day has 1934 (7.3%) zeros Zeros
booking_day has 3776 (14.2%) zeros Zeros
departure_day has 6957 (26.1%) zeros Zeros
number_of_stops has 5301 (19.9%) zeros Zeros

Reproduction

Analysis started2020-10-07 12:55:39.234692
Analysis finished2020-10-07 12:56:19.759433
Duration40.52 seconds
Software versionpandas-profiling v2.9.0
Download configurationconfig.yaml

Variables

df_index
Real number (ℝ≥0)

Distinct5730
Distinct (%)21.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2387.683166
Minimum0
Maximum5783
Zeros7
Zeros (%)< 0.1%
Memory size208.2 KiB

Quantile statistics

Minimum0
5-th percentile230
Q11147
median2284.5
Q33431
95-th percentile5120
Maximum5783
Range5783
Interquartile range (IQR)2284

Descriptive statistics

Standard deviation1487.008623
Coefficient of variation (CV)0.622783058
Kurtosis-0.8135551969
Mean2387.683166
Median Absolute Deviation (MAD)1141.5
Skewness0.3230022456
Sum63626981
Variance2211194.646
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
20237< 0.1%
 
33217< 0.1%
 
25527< 0.1%
 
5057< 0.1%
 
5377< 0.1%
 
26167< 0.1%
 
26327< 0.1%
 
5857< 0.1%
 
6017< 0.1%
 
7137< 0.1%
 
Other values (5720)2657899.7%
 
ValueCountFrequency (%) 
07< 0.1%
 
17< 0.1%
 
27< 0.1%
 
37< 0.1%
 
47< 0.1%
 
ValueCountFrequency (%) 
57831< 0.1%
 
57821< 0.1%
 
57811< 0.1%
 
57801< 0.1%
 
57791< 0.1%
 

flight_path
Real number (ℝ≥0)

ZEROS

Distinct7
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.888622035
Minimum0
Maximum6
Zeros5780
Zeros (%)21.7%
Memory size26.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median3
Q35
95-th percentile6
Maximum6
Range6
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.200323654
Coefficient of variation (CV)0.7617208576
Kurtosis-1.450638224
Mean2.888622035
Median Absolute Deviation (MAD)2
Skewness0.03060599217
Sum76976
Variance4.841424183
MonotocityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%) 
0578021.7%
 
3518419.5%
 
6453517.0%
 
1436016.4%
 
5409215.4%
 
420007.5%
 
26972.6%
 
ValueCountFrequency (%) 
0578021.7%
 
1436016.4%
 
26972.6%
 
3518419.5%
 
420007.5%
 
ValueCountFrequency (%) 
6453517.0%
 
5409215.4%
 
420007.5%
 
3518419.5%
 
26972.6%
 

airline
Real number (ℝ≥0)

ZEROS

Distinct6
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.881829781
Minimum0
Maximum5
Zeros3897
Zeros (%)14.6%
Memory size26.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q12
median3
Q34
95-th percentile5
Maximum5
Range5
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.542275041
Coefficient of variation (CV)0.5351721503
Kurtosis-0.5175576667
Mean2.881829781
Median Absolute Deviation (MAD)1
Skewness-0.6340597304
Sum76795
Variance2.378612301
MonotocityNot monotonic
Histogram with fixed size bins (bins=6)
ValueCountFrequency (%) 
31019438.3%
 
4583121.9%
 
0389714.6%
 
5370313.9%
 
116726.3%
 
213515.1%
 
ValueCountFrequency (%) 
0389714.6%
 
116726.3%
 
213515.1%
 
31019438.3%
 
4583121.9%
 
ValueCountFrequency (%) 
5370313.9%
 
4583121.9%
 
31019438.3%
 
213515.1%
 
116726.3%
 

departure_time_day
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct28
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean11.71206094
Minimum0
Maximum27
Zeros1934
Zeros (%)7.3%
Memory size26.0 KiB

Quantile statistics

Minimum0
5-th percentile0
Q16
median14
Q317
95-th percentile20
Maximum27
Range27
Interquartile range (IQR)11

Descriptive statistics

Standard deviation6.630470153
Coefficient of variation (CV)0.5661232626
Kurtosis-1.107932071
Mean11.71206094
Median Absolute Deviation (MAD)5
Skewness-0.4045555234
Sum312103
Variance43.96313445
MonotocityNot monotonic
Histogram with fixed size bins (bins=28)
ValueCountFrequency (%) 
14390314.6%
 
2019357.3%
 
019347.3%
 
1818847.1%
 
1918286.9%
 
1617996.8%
 
1716866.3%
 
1515335.8%
 
210694.0%
 
710093.8%
 
Other values (18)806830.3%
 
ValueCountFrequency (%) 
019347.3%
 
17692.9%
 
210694.0%
 
38963.4%
 
49623.6%
 
ValueCountFrequency (%) 
27520.2%
 
26500.2%
 
25490.2%
 
24460.2%
 
23570.2%
 

booking_day
Real number (ℝ≥0)

ZEROS

Distinct7
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.090138097
Minimum0
Maximum6
Zeros3776
Zeros (%)14.2%
Memory size208.2 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median3
Q35
95-th percentile6
Maximum6
Range6
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.013497147
Coefficient of variation (CV)0.6515880793
Kurtosis-1.256603876
Mean3.090138097
Median Absolute Deviation (MAD)2
Skewness-0.07836505984
Sum82346
Variance4.054170762
MonotocityIncreasing
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%) 
6405415.2%
 
4404415.2%
 
5403715.1%
 
0377614.2%
 
3370513.9%
 
1351813.2%
 
2351413.2%
 
ValueCountFrequency (%) 
0377614.2%
 
1351813.2%
 
2351413.2%
 
3370513.9%
 
4404415.2%
 
ValueCountFrequency (%) 
6405415.2%
 
5403715.1%
 
4404415.2%
 
3370513.9%
 
2351413.2%
 

departure_day
Real number (ℝ≥0)

ZEROS

Distinct7
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.650517863
Minimum0
Maximum6
Zeros6957
Zeros (%)26.1%
Memory size104.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median3
Q35
95-th percentile6
Maximum6
Range6
Interquartile range (IQR)5

Descriptive statistics

Standard deviation2.145478968
Coefficient of variation (CV)0.8094565212
Kurtosis-1.3641934
Mean2.650517863
Median Absolute Deviation (MAD)2
Skewness0.1575318788
Sum70631
Variance4.603080004
MonotocityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%) 
0695726.1%
 
6355313.3%
 
2345413.0%
 
4344512.9%
 
5328012.3%
 
3313311.8%
 
1282610.6%
 
ValueCountFrequency (%) 
0695726.1%
 
1282610.6%
 
2345413.0%
 
3313311.8%
 
4344512.9%
 
ValueCountFrequency (%) 
6355313.3%
 
5328012.3%
 
4344512.9%
 
3313311.8%
 
2345413.0%
 

departure_time
Categorical

HIGH CORRELATION

Distinct4
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size26.0 KiB
2
14568 
0
7556 
1
4106 
3
 
418
ValueCountFrequency (%) 
21456854.7%
 
0755628.4%
 
1410615.4%
 
34181.6%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length1
Median length1
Mean length1
Min length1

flight_cost
Real number (ℝ≥0)

Distinct1060
Distinct (%)4.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5151.78974
Minimum2540
Maximum20674
Zeros0
Zeros (%)0.0%
Memory size208.2 KiB

Quantile statistics

Minimum2540
5-th percentile3323
Q13797
median4680
Q35910
95-th percentile8731
Maximum20674
Range18134
Interquartile range (IQR)2113

Descriptive statistics

Standard deviation1869.544629
Coefficient of variation (CV)0.3628922613
Kurtosis5.198739453
Mean5151.78974
Median Absolute Deviation (MAD)883
Skewness1.906065332
Sum137284893
Variance3495197.118
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
379717706.6%
 
374614335.4%
 
395512324.6%
 
348111194.2%
 
32718903.3%
 
51328723.3%
 
40067152.7%
 
47126352.4%
 
39565602.1%
 
49224241.6%
 
Other values (1050)1699863.8%
 
ValueCountFrequency (%) 
25403< 0.1%
 
2955530.2%
 
2956640.2%
 
29572020.8%
 
30617< 0.1%
 
ValueCountFrequency (%) 
206744< 0.1%
 
206661< 0.1%
 
188661< 0.1%
 
184861< 0.1%
 
184451< 0.1%
 

number_of_stops
Real number (ℝ≥0)

ZEROS

Distinct6
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.9828129691
Minimum0
Maximum5
Zeros5301
Zeros (%)19.9%
Memory size208.2 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median1
Q31
95-th percentile2
Maximum5
Range5
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.6768069427
Coefficient of variation (CV)0.6886426654
Kurtosis1.941358959
Mean0.9828129691
Median Absolute Deviation (MAD)0
Skewness0.8233686836
Sum26190
Variance0.4580676376
MonotocityNot monotonic
Histogram with fixed size bins (bins=6)
ValueCountFrequency (%) 
11747065.6%
 
0530119.9%
 
2297911.2%
 
38313.1%
 
4660.2%
 
51< 0.1%
 
ValueCountFrequency (%) 
0530119.9%
 
11747065.6%
 
2297911.2%
 
38313.1%
 
4660.2%
 
ValueCountFrequency (%) 
51< 0.1%
 
4660.2%
 
38313.1%
 
2297911.2%
 
11747065.6%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

df_indexflight_pathairlinedeparture_time_daybooking_daydeparture_daydeparture_timeflight_costnumber_of_stops
00311400235001
11311400235001
22312100338681
3332700135001
4432700135000
55321400238680
66351400235000
7735000035000
8834000038680
9935700135010

Last rows

df_indexflight_pathairlinedeparture_time_daybooking_daydeparture_daydeparture_timeflight_costnumber_of_stops
26638577350060094351
266395774547601116871
26640577554760147401
26641577753060094351
2664257785314602116871
266435779531460247401
26644578050760147401
266455781551460289121
2664657825514602111761
266475783502160347403